sceptre part 2I seek to answer four outstanding questions in this writeup:
I applied Fisher’s exact test to the negative control data within the framework of the undercover pipeline (using group size = 1). Below, I plot the resulting p-values on a negative log 10 transformed scale.
Amazingly, the p-values are calibrated far into the tail of the distribution on all of the datasets (with the exception of the Papalexi protein data, the Schraivogel perturb-seq data. and the Schraivogel TAP-seq data). Hence, confounding likely is negligible (because if confounding were non-negligible, then the Fisher exact test, which is a test of marginal rather than conditional independence, would be miscalibrated). Seurat DE and SCEPTRE (not depicted) also are nonparametric tests of marginal independence. Both of these methods are miscalibrated on the negative control data. The miscalibration of Seurat DE and SCEPTRE likely is not the result of confounding; rather, the assumptions that Seurat DE and SCEPTRE make about the null distribution of the test statistic (i.e., that the test statistic is Gaussian distributed or skew-normal distributed) likely are wrong for some fraction of the pairs.
Next, I plot the negative control p-values on an untransformed scale.
A nontrivial fraction of the p-values is exactly equal to one, causing the bulk of the distribution to be conservative. I investigate reasons for this in the next section.
Finally, I applied Fisher’s exact test to the (grouped) positive control data. The number of discoveries that the Fisher’s exact test made after a Bonferoni correction is plotted below (the results of the other methods are plotted as well for reference). Clearly, Fisher’s exact test quite capable of making discoveries. However, Fisher’s exact test does appear to be slightly less powerful than the other methods (aside from MIMOSCA); this (at least in part) is a result of the fact that Fisher’s exact test is the only calibrated method among those plotted. Fisher’s exact test also makes less efficient use of the count-valued data.
The Fisher exact test yields a p-value exactly equal to one for a large fraction of the negative control pairs. What is the reasons for this? I investigate the properties of pairs with p-values exactly equal to one. Most importantly, pairs with a Fisher exact p-value of 1 have a smaller median effective sample size than those with a p-value of less than 1.
median_ess_compare |>
ggplot(mapping = aes(x = no, y = yes)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
theme_bw() +
xlab("Mean effective sample for pairs with p = 1") +
ylab("Mean effective sample for pairs with p < 1")
All methods (with the exception of MIMOSCA) are miscalibrated on the Schraivogel perturb-seq data. Why is this the case? Original results plotted below.
One possible explanation for the miscalibration that we observe is confounding. Gene carried out an analysis to assess the extent to which the perturbation indicators are confounded by the technical factors across datasets. Gene did not find any evidence of confounding on the Schraivogel perturb-seq data. Thus, confounding likely is not the culprit for the problems that we see here (unless, of course, there is unmeasured confounding; more on that below).
I hypothesized that variability among the NTCs might be responsible for the miscalibration that we observe. If some NTCs actually have an effect (i.e., if some NTCs are not true NTCs), and if others have no effect, then the undercover p-values contain some amount of signal. To investigate this hypothesis, I carried out an undercover analysis in which I randomly partitioned the NTCs into two equally sized groups of “control” and “undercover” gRNAs, “averaging out” and thus mitigating the effect of any non-true NTC(s). (In other words, I ran the undercover gRNA pipeline on the Schraivogel perturb-seq data, setting the group size equal to half the total number of NTCs.)
The calibration of the p-values improves considerably. However, the methods still are miscalibrated. I suspect that variability among the NTCs is at least partially responsible, but the full story still is not clear.
Finally, we note that we are looking further into the tail of the Schraivogel perturb-seq dataset than (for example) the enhancer screen dataset, as the former contains more pairs. What happens when we downsample the perturb-seq dataset so that the perturb-seq dataset contains the same number of pairs as the TAP-seq dataset? We attempt to answer this using Fisher’s exact test.
The perturb-seq dataset is not all that different from the TAP-seq dataset (and in fact, the TAP-seq data likely are worse).
It also is possible that latent factors confound the gRNA assignments. I ran a PCA on the gene expression matrices of the TAP-seq dataset and the perturb-seq dataset, extracting the top principal component from the first (24% variance explained) and the top two principal components from the second (28% variance explained.) I regressed the perturbation indicators onto the principal components via logistic regression and obtained a p-value via likelihood ratio test. I subsetted the data to filter for cells containing NTCs only.
We must determine a reasonable QC threshold to use in our analysis. If we choose the QC threshold too liberally (i.e., if we allow too many pairs to pass through the QC filter), then we risk putting ourselves in an “uninteresting” region of the problem space in which we are under-powered to make discoveries for pairs with low effective sample sizes. This not only wastes compute but also causes a reduction in power at the multiple testing correction step. By contrast, if we choose our QC threshold too stringently (i.e., if we allow too few pairs to pass through the QC filter), then we risk throwing out biologically interesting pairs that we might have otherwise rejected. An additional complication is that the “effective sample size” of a given pair (i.e., the number of cells with nonzero gene expression) is a function of both the gene and the gRNA. Thus, QC in single-cell CRISPR screen analysis must operate at the level of the gene-gRNA pair (instead of at the level of the gene and the gRNA separately).
A reasonable strategy for selecting the QC threshold is as follows: locate the minimum effective sample size such that a well-calibrated method (e.g., Fisher’s exact test) consistently rejects positive control pairs whose effective sample size is (approximately) equal to that minimizer. To determine this threshold empirically, I applied Fisher’s exact test, SCEPTRE, the Weissman Method, and Seurat DE to the singleton (i.e., ungrouped) positive control data. (I did not apply MIMOSCA and Schraivogel method, as these methods are computationally expensive. Furthermore, I did not apply Liscovitch method, as Liscovitch method is badly miscalibrated and thus uninformative.) First, I plot the number of rejections that each method makes on each dataset.
Seurat DE and SCEPTRE appear to be the most powerful methods in general, with Seurat DE having a slight advantage over SCEPTRE. Next, I plot a histogram of the “effective sample size” of each pair, faceted by dataset.
Effective sample sizes for the (singleton) positive control pairs range from zero up to several thousand in most cases (and up to 10,000+ on the Schraivogel ground truth perturb-seq data).
Next, I plot the Fisher exact p-value vs. the effective sample size of each pair on each of the positive control datasets. Pairs with a p-value of above 0.001 are colored in red and those below 0.001 are colored in blue. (The number 0.001 was chosen somewhat arbitrarily; 0.001 is meant to be a proxy for a multiple testing correction threshold.) The vertical black line indicates the pair with the smallest effective sample size such that its p-value is less than 0.001. Across datasets, we see that the minimum effective sample size required to reject a pair at the 0.001 level is about 500.
We create the same plot for Seurat DE. The results are similar to those of the Fisher exact test.
The apparent minimum effective sample size required to reject at a 0.001 level (i.e., $\approx$ 250 - 750) is surprisingly large.
As we apply increasingly stringent QC, the problem becomes easier, and thus calibration improves. We test that hypothesis here. First, we plot the results that we obtain using the default QC (i.e., filtering for genes that are expressed in > 0.005 of cells).
Next, we remove all pairs with an effective sample size of less than 250. This amounts to removing 30% of pairs. The calibration improves, especially on the sparse Frangieh data.
Finally, we restrict our attention to pairs that have an effective sample size exceeding 750. This amounts to removing 50% of (the original) pairs. Calibration improves further.
For better or for worse, when we restrict our attention to more “meaningful” regions of the parameter space, Seurat DE becomes a formidable competitor!